Left Censoring as a Cause for Missingness

Figure S1. Percent missing values in each dataset compared to the p-values from the left censored binomial test.

Simulated Data

Simple Data Set

Figure S2. Comparison of ICI-Kt and Pearson correlations for perfectly positive and negatively correlated samples, systematically replacing values with NA. NA values from Pearson were replaced with zero for this comparison.

We can also examine the full set of positive and negative correlations generated as we vary the number of missing entries between two positively correlated samples and two negatively correlated samples. These distributions are shown in Figure S4. We can see that the distributions from both ICI-Kt and Kendall-tau are the same, which is expected given we replaced missing values (NA) with zero within the ICI-Kt code, and replaced missing values (NA) with zero prior to calculating Kendall-tau correlations.

Figure S3. ICI-Kendall-tau correlation as missing values are varied between two samples.

Figure S4. Kendall-tau correlation as missing values are varied between two samples and replaced with 0 before calculating Kendall-tau.

Comparison to Other Correlation Measures

Figure S5. Difference of estimated correlation with missingness introduced compared to a reference correlation of 1 for the positive or -1 for the negative case, as a function of the average number of missing entries in X and Y sample (# NA). Points are colored by how many points are missing on average between the two samples X and Y.

Figure S6. Using an alternative random-number seed to generate the missing locations, this figure is comparing the correlation values obtained by Pearson, Kendall, and ICI-Kt correlation as an increasing number of missing values (0 - 500) in the bottom half of either sample for both positively (correlation = 1) and negatively (correlation = -1) correlated samples. Points are colored by how many points were set to missing on average between the two samples. A subset of 10,000 points was used for visualization.

Figure S7. Using an alternative random-number seed to generate the missing locations, this figure is showing the difference of estimated correlation with missingness introduced compared to a reference correlation of 1 for the positive or -1 for the negative case, as a function of the average number of missing entries in X and Y sample (# NA). Points are colored by how many points are missing on average between the two samples X and Y.

Semi-Realistic Data Set

Figure S8. Effect of introducing missing values from a cutoff (A & B) or randomly (C) on different measures of correlation, including ICI-Kt, Kendall with pairwise complete, Kendall replacing missing with 0, Pearson with pairwise complete, and Pearson replacing missing with 0. A) Missing values introduced by setting an increasing cutoff. B) Missing values introduced by setting an increasing cutoff, and then log-transforming the values before calculating correlation. C) Missing values introduced at random. For the random case, each sample of random positions was repeated 100 times.

Changes In Correlation Due to Changes in Dynamic Range and Imputation

Figure S9. Number of missing values in each of the 100 samples as a function of changing the dynamic range of the values in the sample by increasing the lower limit of detection.

Outlier Detection

Table S1. The statistical results of the pairwise comparisons of each method based on the significant fractions after removing outliers. P-values were adjusted using the Bonferroni method.

Comparison

Difference

P-Value

P-Adjusted

icikt v original

0.0137

7.9e-13

3.5e-11

icikt_complete v original

0.013

1.5e-11

6.9e-10

icikt + pearson v original

0.0105

1.5e-10

6.6e-09

icikt_complete + pearson v original

0.0103

5.0e-10

2.2e-08

kt_base v original

0.0104

1.1e-09

4.9e-08

pearson_log1p v original

0.011

2.7e-09

1.2e-07

pearson_log v original

0.00964

4.1e-09

1.8e-07

icikt v pearson_base

0.00819

3.8e-08

1.7e-06

icikt v pearson_base_nozero

0.00949

5.5e-08

2.5e-06

icikt_complete v pearson_base_nozero

0.00888

1.5e-06

6.8e-05

icikt_complete v pearson_base

0.00758

3.6e-06

1.6e-04

pearson_base_nozero v kt_base

-0.00625

2.7e-05

1.2e-03

pearson_base_nozero v icikt + pearson

-0.00629

6.1e-05

2.7e-03

pearson_base_nozero v pearson_log1p

-0.00687

1.0e-04

4.7e-03

pearson_base_nozero v icikt_complete + pearson

-0.00618

1.4e-04

6.5e-03

pearson_base v icikt + pearson

-0.00499

1.6e-04

7.1e-03

pearson_base_nozero v pearson_log

-0.00547

2.0e-04

9.0e-03

pearson_base v kt_base

-0.00495

2.3e-04

1.0e-02

pearson_base v original

0.00547

2.6e-04

1.2e-02

pearson_base v pearson_log1p

-0.00558

2.8e-04

1.3e-02

pearson_base v icikt_complete + pearson

-0.00488

5.0e-04

2.2e-02

icikt v icikt + pearson

0.0032

5.1e-04

2.3e-02

icikt_complete v icikt_complete + pearson

0.0027

5.2e-04

2.3e-02

Feature-Feature Network Partitioning

Figure S10. Boxplot and sina plots of paired differences of partitioning ratios across datasets. Red line indicates zero difference. Differences are calculated as \(method_{1} - method_{2}\).

Performance

Figure S11. Time in seconds needed as a function of the number of features, with a fitted line for the assumed complexity for each of the methods tested, including R’s Pearson correlation, the ICI-Kt mergesort, and R’s Kendall-tau correlation algorithm.

References